VLLM Embedding Node
The VLLM Embedding node provides high-performance local embedding generation using VLLM's optimized inference engine with advanced GPU acceleration. It supports tensor parallelism for distributing large models across multiple GPUs, quantization for reduced memory usage, and LoRA adapters for fine-tuned model variants. These optimizations enable production-scale deployments with high throughput and low latency.
How It Works
When the node executes, it receives text input from a workflow variable, loads the specified model from HuggingFace Hub or local cache, processes the text through VLLM's optimized inference pipeline, and returns embedding vectors as arrays of floating-point numbers. Each text input produces one embedding vector, with dimensionality determined by the model. The node initializes the VLLM engine with the specified configuration, processes texts through the model, and stores the resulting vectors in the output variable.
VLLM provides production-grade performance through continuous batching, PagedAttention memory management, and optimized CUDA kernels. The node supports advanced deployment scenarios including tensor parallelism to distribute large models across multiple GPUs, quantization to reduce memory usage, and CPU offloading for models exceeding GPU memory.
Output embeddings maintain correlation with input items through unique identifiers, with each embedding traced back to its source text via UUID. Failed embedding generation for individual items does not stop processing of other items.
Configuration Parameters
Input field
Input Field (Text, Required): Workflow variable containing text to embed.
The node expects a list of embedding request objects where each object contains a type field (set to "text"), an optional id field (string for tracking), and a text field (string content to embed). Single objects are automatically converted to single-item lists.
Output Field
Output Field (Text, Required): Workflow variable where embedding results are stored.
The output is a list of EmbeddingResponse objects where each object contains a uuid field (string identifier matching input ID or generated UUID) and an embeddings field (array of floating-point numbers).
Common naming patterns: text_embeddings, document_vectors, vllm_embeddings, local_embeddings.
Model
Model (Text, Required): HuggingFace model path for embedding generation.
Examples: BAAI/bge-large-en-v1.5, intfloat/e5-large-v2. Models are automatically downloaded from HuggingFace Hub on first use and cached locally. Variable interpolation using ${variable_name} syntax is supported.
Task Type
Task Type (Dropdown, Default: embed): Embedding task type optimization.
- embed - General-purpose embeddings
- retrieval - Optimized for search and RAG applications
- classification - Optimized for categorization tasks
Tensor Parallel Size
Tensor Parallel Size (Number, Optional): Number of GPUs for tensor parallelism.
Distributes large models across multiple GPUs for models that don't fit on a single GPU. Set to the number of available GPUs (e.g., 2, 4, 8).
Pipeline Parallel Size
Pipeline Parallel Size (Number, Optional): Number of pipeline parallel stages.
Enables pipeline parallelism for additional model distribution across GPUs.
Data Type
Data Type (Dropdown, Optional): Model weight precision.
- auto - Automatic selection based on model
- half / float16 - 16-bit floating point (faster, less memory)
- bfloat16 - Brain float 16 (better numerical stability)
- float32 - 32-bit floating point (slower, more accurate)
Quantization
Quantization (Dropdown, Optional): Quantization method to reduce memory usage.
- awq - 4-bit quantization (requires AWQ-quantized model)
- gptq - 4-bit quantization (requires GPTQ-quantized model)
- fp8 - 8-bit floating point quantization
- squeezellm / marlin / gptq_marlin - Other quantization methods
Requires compatible pre-quantized model. Reduces memory usage significantly with minimal quality loss.
GPU Memory Utilization*
GPU Memory Utilization (Number, Default: 0.9): Fraction of GPU memory to use (0.0-1.0).
Default 0.9 reserves 10% for other processes. Lower values (0.7-0.8) leave more memory for concurrent workloads.
Max Model Length
Max Model Length (Number, Optional): Maximum model context length in tokens.
Limits the maximum input sequence length. Minimum value is 1.
Max Number of Sequences
Max Number of Sequences (Number, Optional): Maximum sequences per iteration.
Controls batch processing capacity. Higher values increase throughput but require more memory. Minimum value is 1.
Tokenizer
Tokenizer (Text, Optional): Tokenizer name or path.
Defaults to the model's tokenizer if not specified. Use when a different tokenizer is required.
Tokenizer Mode
Tokenizer Mode (Dropdown, Optional): Tokenizer loading mode.
Controls how the tokenizer is loaded and initialized.
Trust Remote Code
Trust Remote Code (Toggle, Default: true): Allow execution of custom code from HuggingFace.
Required for some models but has security implications. Only disable if you trust the model source.
Model Revision
Model Revision (Text, Optional): Model revision (branch/tag/commit).
Specifies a specific model version from HuggingFace Hub.
Tokenizer Revision
Tokenizer Revision (Text, Optional): Tokenizer revision (branch/tag/commit).
Specifies a specific tokenizer version from HuggingFace Hub.
Random Seed
Random Seed (Number, Optional): Random seed for reproducibility.
Ensures consistent results across runs. Minimum value is 0.
Swap Space
Swap Space (Number, Optional): CPU swap space in gigabytes.
Enables CPU memory swapping for models exceeding GPU memory.
CPU Offload
CPU Offload (Number, Optional): CPU memory for offloading in gigabytes.
Offloads model layers to CPU memory when GPU memory is insufficient.
Number of GPUs
Number of GPUs (Number, Optional): Number of GPUs to use.
Limits GPU usage when multiple GPUs are available.
Download Directory
Download Directory (Text, Optional): Directory to download and cache models.
Controls where model files are stored. Use fast storage (SSD) for better loading performance.
Load Format
Load Format (Dropdown, Optional): Format for loading model weights.
Specifies how model weights are loaded from disk.
Enforce Eager Execution
Enforce Eager Execution (Toggle, Optional): Enforce eager execution mode.
Disables CUDA graph optimization for debugging or compatibility.
Disable Custom All-Reduce
Disable Custom All-Reduce (Toggle, Optional): Disable custom all-reduce kernel.
Uses standard all-reduce operations instead of optimized kernels.
Enable prefix caching
Enable Prefix Caching (Toggle, Optional): Enable automatic prefix caching.
Caches common prompt prefixes for better performance with repeated patterns. Improves throughput when processing similar texts.
Disable Sliding Window
Disable Sliding Window (Toggle, Optional): Disable sliding window attention.
Disables sliding window attention mechanism for models that support it.
Enable LoRA
Enable LoRA (Toggle, Optional): Enable LoRA adapter support.
Allows using fine-tuned model variants through LoRA adapters without loading separate full models.
Max LoRA adapters
Max LoRA Adapters (Number, Optional): Maximum number of LoRA adapters to load.
Limits concurrent LoRA adapters when Enable LoRA is active.
Max LoRA rank
Max LoRA Rank (Number, Optional): Maximum LoRA adapter rank.
Limits the rank of LoRA adapters that can be loaded.
Common Parameters
This node supports common parameters shared across workflow nodes, including Stream Output Response, Streaming Messages, and Logging Mode. For detailed information, see Common Parameters.
Best Practices
- Start with default settings (Model only) and add optimizations based on performance requirements
- Use Tensor Parallel Size for models exceeding single GPU memory—set to the number of available GPUs
- Enable Quantization (AWQ or GPTQ) for large models using pre-quantized models from HuggingFace
- Set GPU Memory Utilization to 0.7-0.8 when running multiple models or workflows concurrently
- Enable Prefix Caching for workflows processing similar texts repeatedly
- Use Data Type float16 or bfloat16 for production deployments to balance speed and accuracy
- Configure Download Directory to use fast storage (SSD) for better model loading performance
- Monitor GPU memory usage and adjust Max Number of Sequences to optimize throughput
Limitations
- Local execution only: Models run on the server hosting the workflow engine. Ensure sufficient GPU resources and memory.
- First-run download delay: The first execution with a new model triggers a download from HuggingFace Hub which may take several minutes to hours depending on model size.
- Text-only support: The node only supports text embeddings. Image embedding requests fail.
- GPU memory requirements: Large models and high Max Number of Sequences can exceed GPU memory limits, causing out-of-memory errors.
- Quantization compatibility: Quantization methods require pre-quantized models. Standard models cannot be quantized on-the-fly.
- Tensor parallelism overhead: Multi-GPU distribution adds communication overhead. Only beneficial for models that don't fit on single GPU.